Project Title : Cybersecurity Attack Analysis¶

About the Dataset :¶

Incribo's synthetic cyber dataset is a collection of 40,000 records that contains 25 different metrics. The data is designed to represent realistic travel history, making it a valuable resource for cybersecurity analysis tasks. Analysts can use the dataset to assess heatmaps, attack signatures, and other types of cybersecurity data.

Cybersecurity Dataset¶

Column Name Description
Timestamp Date and time of the internet activity
Source IP Address Internet address of the sender
Destination IP Address Internet address of the receiver
Source Port Number used by the sender to send information
Destination Port Number used by the receiver to get information
Protocol Language used by the devices to talk to each other (e.g., chat, email)
Packet Length Size of the information package sent over the internet
Packet Type Kind of information package (e.g., regular message, control message)
Traffic Type Type of internet activity (e.g., browsing websites, sending emails)
Payload Data The actual content sent over the internet
Malware Indicators Signs that something bad (malware) might be trying to sneak in
Anomaly Scores Numbers showing unusual activity compared to normal internet use
Alerts/Warnings Notifications from security systems saying something suspicious might be happening
Attack Type Kind of cyberattack that was done or might be happening (e.g., overwhelming a system with traffic, stealing information)
Attack Signature Unique fingerprint of a known cyberattack
Action Taken What was done to stop the threat
Severity Level How serious the threat was (e.g., not serious, kind of serious, very serious)
User Information Details about the person using the internet
Device Information Details about the computer or phone being used
Network Segment Part of the internet where the activity happened
Geo-location Data Location information based on internet addresses
Proxy Information Details about any relays used to connect to the internet
Firewall Logs Records of what the security system allowed or blocked on the internet
IDS/IPS Alerts Notifications from systems that watch for cyberattacks
Log Source Where the information came from (e.g., security software, router)

Importing the Required Libraries¶

In [ ]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [ ]:
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

Load the Dataset¶

In [ ]:
df = pd.read_csv("cybersecurity_attacks.csv")
In [ ]:
df.head(5)
Out[ ]:
Timestamp Source IP Address Destination IP Address Source Port Destination Port Protocol Packet Length Packet Type Traffic Type Payload Data ... Action Taken Severity Level User Information Device Information Network Segment Geo-location Data Proxy Information Firewall Logs IDS/IPS Alerts Log Source
0 2023-05-30 06:33:58 103.216.15.12 84.9.164.252 31225 17616 ICMP 503 Data HTTP Qui natus odio asperiores nam. Optio nobis ius... ... Logged Low Reyansh Dugal Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ... Segment A Jamshedpur, Sikkim 150.9.97.135 Log Data NaN Server
1 2020-08-26 07:08:30 78.199.217.198 66.191.137.154 17245 48166 ICMP 1174 Data HTTP Aperiam quos modi officiis veritatis rem. Omni... ... Blocked Low Sumer Rana Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ... Segment B Bilaspur, Nagaland NaN Log Data NaN Firewall
2 2022-11-13 08:23:25 63.79.210.48 198.219.82.17 16811 53600 UDP 306 Control HTTP Perferendis sapiente vitae soluta. Hic delectu... ... Ignored Low Himmat Karpe Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ... Segment C Bokaro, Rajasthan 114.133.48.179 Log Data Alert Data Firewall
3 2023-07-02 10:38:46 163.42.196.10 101.228.192.255 20018 32534 UDP 385 Data HTTP Totam maxime beatae expedita explicabo porro l... ... Blocked Medium Fateh Kibe Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ... Segment B Jaunpur, Rajasthan NaN NaN Alert Data Firewall
4 2023-07-16 13:11:07 71.166.185.76 189.243.174.238 6131 26646 TCP 1462 Data DNS Odit nesciunt dolorem nisi iste iusto. Animi v... ... Blocked Low Dhanush Chad Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ... Segment C Anantapur, Tripura 149.6.110.119 NaN Alert Data Firewall

5 rows × 25 columns

Exploratory Data Analysis¶

In [ ]:
# List Columns
df.columns
Out[ ]:
Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
       'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
       'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
       'Anomaly Scores', 'Alerts/Warnings', 'Attack Type', 'Attack Signature',
       'Action Taken', 'Severity Level', 'User Information',
       'Device Information', 'Network Segment', 'Geo-location Data',
       'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts', 'Log Source'],
      dtype='object')
In [ ]:
# Shape of data
print(f"There are {df.shape[0]}, row and {df.shape[1]} columns in the Cybersecruity dataset")
There are 40000, row and 33 columns in the Cybersecruity dataset
In [ ]:
# Dataset Info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Timestamp               40000 non-null  object 
 1   Source IP Address       40000 non-null  object 
 2   Destination IP Address  40000 non-null  object 
 3   Source Port             40000 non-null  int64  
 4   Destination Port        40000 non-null  int64  
 5   Protocol                40000 non-null  object 
 6   Packet Length           40000 non-null  int64  
 7   Packet Type             40000 non-null  object 
 8   Traffic Type            40000 non-null  object 
 9   Payload Data            40000 non-null  object 
 10  Malware Indicators      20000 non-null  object 
 11  Anomaly Scores          40000 non-null  float64
 12  Alerts/Warnings         19933 non-null  object 
 13  Attack Type             40000 non-null  object 
 14  Attack Signature        40000 non-null  object 
 15  Action Taken            40000 non-null  object 
 16  Severity Level          40000 non-null  object 
 17  User Information        40000 non-null  object 
 18  Device Information      40000 non-null  object 
 19  Network Segment         40000 non-null  object 
 20  Geo-location Data       40000 non-null  object 
 21  Proxy Information       20149 non-null  object 
 22  Firewall Logs           20039 non-null  object 
 23  IDS/IPS Alerts          19950 non-null  object 
 24  Log Source              40000 non-null  object 
dtypes: float64(1), int64(3), object(21)
memory usage: 7.6+ MB

Examining the Null and Missing Values¶

Let's check for missing data! Understanding null and missing values is important for accurate analysis.

In [ ]:
df.isnull().sum().sort_values(ascending=False)
Out[ ]:
Alerts/Warnings           20067
IDS/IPS Alerts            20050
Malware Indicators        20000
Firewall Logs             19961
Proxy Information         19851
Attack Type                   0
Geo-location Data             0
Network Segment               0
Device Information            0
User Information              0
Severity Level                0
Action Taken                  0
Attack Signature              0
Timestamp                     0
Source IP Address             0
Anomaly Scores                0
Payload Data                  0
Traffic Type                  0
Packet Type                   0
Packet Length                 0
Protocol                      0
Destination Port              0
Source Port                   0
Destination IP Address        0
Log Source                    0
dtype: int64
In [ ]:
# Missing Value by Percentage
df.isnull().sum() / len(df) * 100
Out[ ]:
Timestamp                  0.0000
Source IP Address          0.0000
Destination IP Address     0.0000
Source Port                0.0000
Destination Port           0.0000
Protocol                   0.0000
Packet Length              0.0000
Packet Type                0.0000
Traffic Type               0.0000
Payload Data               0.0000
Malware Indicators        50.0000
Anomaly Scores             0.0000
Alerts/Warnings           50.1675
Attack Type                0.0000
Attack Signature           0.0000
Action Taken               0.0000
Severity Level             0.0000
User Information           0.0000
Device Information         0.0000
Network Segment            0.0000
Geo-location Data          0.0000
Proxy Information         49.6275
Firewall Logs             49.9025
IDS/IPS Alerts            50.1250
Log Source                 0.0000
dtype: float64

Missing Values in Cybersecurity Dataset¶

We've identified significant missing values in several columns of our cybersecurity dataset, containing 40,000 rows and 25 columns. Here's a breakdown of the most concerning columns:

  • Alerts/Warnings: 50.17% (20,067 missing values)
  • IDS/IPS Alerts: 50.13% (20,050 missing values)
  • Malware Indicators: 50.00% (20,000 missing values)
  • Firewall Logs: 49.90% (19,961 missing values)
  • Proxy Data: 49.63% (19,851 missing values)

As you can see, these columns have a substantial amount of missing data, potentially impacting our analysis. We'll address these missing values in the next steps to ensure robust and reliable results.

First, let's address the missing values.¶

Let's address the missing values in the cybersecurity dataset since they may cause errors in our subsequent analysis. Prior to selecting the best course of action to address the missing values, we must first identify them.

In [ ]:
# Determine recent activity
# If the Alert Triggered is present, then it's a yes, else it's a no.
df['Alerts/Warnings'] = df['Alerts/Warnings'].apply(lambda x: 'yes' if x == 'Alert Triggered' else 'no')
In [ ]:
#If the Malware Indicators is present, then it's a No, else it's a No Detection.
df['Malware Indicators'] = df['Malware Indicators'].apply(lambda x: 'No Detection' if pd.isna(x) else x)
In [ ]:
#If Proxy Information is missing, it is assumed that there is no proxy
df['Proxy Information'] = df['Proxy Information'].apply(lambda x: 'No proxy' if pd.isna(x) else x)
In [ ]:
#If Firewall Logs is missing, it is assumed that there is no data
df['Firewall Logs'] = df['Firewall Logs'].apply(lambda x: 'No Data' if pd.isna(x) else x)
In [ ]:
#If IDS/IPS Alerts is "No Data", then it means that the alert was not generated by IDS/IPS.
df['IDS/IPS Alerts'] = df['IDS/IPS Alerts'].apply(lambda x: 'No Data' if pd.isna(x) else x)
In [ ]:
df.isnull().sum().sort_values(ascending=False)
Out[ ]:
Timestamp                 0
Attack Type               0
IDS/IPS Alerts            0
Firewall Logs             0
Proxy Information         0
Geo-location Data         0
Network Segment           0
Device Information        0
User Information          0
Severity Level            0
Action Taken              0
Attack Signature          0
Alerts/Warnings           0
Source IP Address         0
Anomaly Scores            0
Malware Indicators        0
Payload Data              0
Traffic Type              0
Packet Type               0
Packet Length             0
Protocol                  0
Destination Port          0
Source Port               0
Destination IP Address    0
Log Source                0
dtype: int64

Removed all missing values from the information

Explore the Device Information Column¶

In [ ]:
df['Device Information'].value_counts()
Out[ ]:
Device Information
Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 6.2; Trident/3.0)                                                                                       35
Mozilla/5.0 (compatible; MSIE 5.0; Windows 98; Trident/4.1)                                                                                           34
Mozilla/5.0 (compatible; MSIE 6.0; Windows CE; Trident/4.0)                                                                                           33
Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/3.0)                                                                                       31
Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 5.2; Trident/4.1)                                                                                       31
                                                                                                                                                      ..
Mozilla/5.0 (Macintosh; PPC Mac OS X 10_9_2; rv:1.9.2.20) Gecko/6474-09-17 07:53:12 Firefox/3.6.9                                                      1
Mozilla/5.0 (iPhone; CPU iPhone OS 14_2 like Mac OS X) AppleWebKit/535.0 (KHTML, like Gecko) CriOS/19.0.850.0 Mobile/88P921 Safari/535.0               1
Mozilla/5.0 (Windows NT 5.0; km-KH; rv:1.9.2.20) Gecko/7799-03-13 07:30:55 Firefox/3.8                                                                 1
Mozilla/5.0 (X11; Linux i686; rv:1.9.7.20) Gecko/6248-04-01 13:49:59 Firefox/3.8                                                                       1
Mozilla/5.0 (iPod; U; CPU iPhone OS 3_0 like Mac OS X; tg-TJ) AppleWebKit/534.33.5 (KHTML, like Gecko) Version/4.0.5 Mobile/8B116 Safari/6534.33.5     1
Name: count, Length: 32104, dtype: int64
In [ ]:
# Extract 'Device'
df['Browser'] = df['Device Information'].str.split('/').str[0]
In [ ]:
# created the Browser column.
df['Browser']
Out[ ]:
0        Mozilla
1        Mozilla
2        Mozilla
3        Mozilla
4        Mozilla
          ...   
39995    Mozilla
39996    Mozilla
39997    Mozilla
39998    Mozilla
39999    Mozilla
Name: Browser, Length: 40000, dtype: object
In [ ]:
import re
# OS and device patterns to search for
patterns = [
    r'Windows',
    r'Linux',
    r'Android',
    r'iPad',
    r'iPod',
    r'iPhone',
    r'Macintosh',
]

def extract_device_or_os(user_agent):
    for pattern in patterns:
        match = re.search(pattern, user_agent, re.I)  # re.I makes the search case-insensitive
        if match:
            return match.group()
    return 'Unknown'  # Return 'Unknown' if no patterns match

# Extract device or OS
df['Device/OS'] = df['Device Information'].apply(extract_device_or_os)
In [ ]:
df['Browser'].value_counts()
Out[ ]:
Browser
Mozilla    31951
Opera       8049
Name: count, dtype: int64

The dataset has 31,951 occurrences of the Mozilla browser and 8,049 instances of the Opera browser are included in the sample, indicating a significant preference for Mozilla over Opera.

In [ ]:
df['Device/OS'].value_counts()
Out[ ]:
Device/OS
Windows      17953
Linux         8840
Macintosh     5813
iPod          2656
Android       1620
iPhone        1567
iPad          1551
Name: count, dtype: int64

With 17,953 instances, Windows is the most widely used operating system, followed by Linux (8840 instances) and Macintosh (5813 instances), according to the results. There are far fewer examples of mobile devices, such as the iPod, Android, iPhone, and iPad (2656 for iPod, 1551 for iPad). The devices and OS are displayed in this data count.

In [ ]:
#Dropping the Device Information Column
df = df.drop('Device Information', axis = 1)
In [ ]:
def extract_time_features(df, Timestamp):
    # Convert timestamp column to datetime if it's not already
    df[Timestamp] = pd.to_datetime(df[Timestamp])
    
    # Extract time features
    df['Year'] = df[Timestamp].dt.year
    df['Month'] = df[Timestamp].dt.month
    df['Day'] = df[Timestamp].dt.day
    df['Hour'] = df[Timestamp].dt.hour
    df['Minute'] = df[Timestamp].dt.minute
    df['Second'] = df[Timestamp].dt.second
    df['DayOfWeek'] = df[Timestamp].dt.dayofweek
    
    return df
In [ ]:
# Assuming df is your DataFrame
# Call the function and store the result in a new DataFrame
new_df = extract_time_features(df, 'Timestamp')

# Check if new columns are created
print(new_df.head())
            Timestamp Source IP Address Destination IP Address  Source Port   
0 2023-05-30 06:33:58     103.216.15.12           84.9.164.252        31225  \
1 2020-08-26 07:08:30    78.199.217.198         66.191.137.154        17245   
2 2022-11-13 08:23:25      63.79.210.48          198.219.82.17        16811   
3 2023-07-02 10:38:46     163.42.196.10        101.228.192.255        20018   
4 2023-07-16 13:11:07     71.166.185.76        189.243.174.238         6131   

   Destination Port Protocol  Packet Length Packet Type Traffic Type   
0             17616     ICMP            503        Data         HTTP  \
1             48166     ICMP           1174        Data         HTTP   
2             53600      UDP            306     Control         HTTP   
3             32534      UDP            385        Data         HTTP   
4             26646      TCP           1462        Data          DNS   

                                        Payload Data  ... Log Source  Browser   
0  Qui natus odio asperiores nam. Optio nobis ius...  ...     Server  Mozilla  \
1  Aperiam quos modi officiis veritatis rem. Omni...  ...   Firewall  Mozilla   
2  Perferendis sapiente vitae soluta. Hic delectu...  ...   Firewall  Mozilla   
3  Totam maxime beatae expedita explicabo porro l...  ...   Firewall  Mozilla   
4  Odit nesciunt dolorem nisi iste iusto. Animi v...  ...   Firewall  Mozilla   

   Device/OS  Year Month Day Hour Minute Second DayOfWeek  
0    Windows  2023     5  30    6     33     58         1  
1    Windows  2020     8  26    7      8     30         2  
2    Windows  2022    11  13    8     23     25         6  
3  Macintosh  2023     7   2   10     38     46         6  
4    Windows  2023     7  16   13     11      7         6  

[5 rows x 33 columns]
In [ ]:
df.head(5)
Out[ ]:
Timestamp Source IP Address Destination IP Address Source Port Destination Port Protocol Packet Length Packet Type Traffic Type Payload Data ... Log Source Browser Device/OS Year Month Day Hour Minute Second DayOfWeek
0 2023-05-30 06:33:58 103.216.15.12 84.9.164.252 31225 17616 ICMP 503 Data HTTP Qui natus odio asperiores nam. Optio nobis ius... ... Server Mozilla Windows 2023 5 30 6 33 58 1
1 2020-08-26 07:08:30 78.199.217.198 66.191.137.154 17245 48166 ICMP 1174 Data HTTP Aperiam quos modi officiis veritatis rem. Omni... ... Firewall Mozilla Windows 2020 8 26 7 8 30 2
2 2022-11-13 08:23:25 63.79.210.48 198.219.82.17 16811 53600 UDP 306 Control HTTP Perferendis sapiente vitae soluta. Hic delectu... ... Firewall Mozilla Windows 2022 11 13 8 23 25 6
3 2023-07-02 10:38:46 163.42.196.10 101.228.192.255 20018 32534 UDP 385 Data HTTP Totam maxime beatae expedita explicabo porro l... ... Firewall Mozilla Macintosh 2023 7 2 10 38 46 6
4 2023-07-16 13:11:07 71.166.185.76 189.243.174.238 6131 26646 TCP 1462 Data DNS Odit nesciunt dolorem nisi iste iusto. Animi v... ... Firewall Mozilla Windows 2023 7 16 13 11 7 6

5 rows × 33 columns

In [ ]:
df.describe(include = 'object')
Out[ ]:
Source IP Address Destination IP Address Protocol Packet Type Traffic Type Payload Data Malware Indicators Alerts/Warnings Attack Type Attack Signature ... Severity Level User Information Network Segment Geo-location Data Proxy Information Firewall Logs IDS/IPS Alerts Log Source Browser Device/OS
count 40000 40000 40000 40000 40000 40000 40000 40000 40000 40000 ... 40000 40000 40000 40000 40000 40000 40000 40000 40000 40000
unique 40000 40000 3 2 3 40000 2 2 3 2 ... 3 32389 3 8723 20149 2 2 2 2 7
top 103.216.15.12 84.9.164.252 ICMP Control DNS Qui natus odio asperiores nam. Optio nobis ius... IoC Detected no DDoS Known Pattern A ... Medium Ishaan Chaudhari Segment C Ghaziabad, Meghalaya No proxy Log Data No Data Firewall Mozilla Windows
freq 1 1 13429 20237 13376 1 20000 20067 13428 20076 ... 13435 6 13408 16 19851 20039 20050 20116 31951 17953

4 rows × 21 columns

In [ ]:
df.columns
Out[ ]:
Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
       'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
       'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
       'Anomaly Scores', 'Alerts/Warnings', 'Attack Type', 'Attack Signature',
       'Action Taken', 'Severity Level', 'User Information', 'Network Segment',
       'Geo-location Data', 'Proxy Information', 'Firewall Logs',
       'IDS/IPS Alerts', 'Log Source', 'Browser', 'Device/OS', 'Year', 'Month',
       'Day', 'Hour', 'Minute', 'Second', 'DayOfWeek'],
      dtype='object')

Data Visualization¶

In [ ]:
# Checking the Day Column ploting with plotly


plt = px.histogram(
    df, 
    x='Day', 
    color='Malware Indicators', 
    title='Number of Malware Attacks by Day',
    color_discrete_map={'0': 'lightblue', '1': 'salmon'}  # Choose any two contrasting colors
)
plt.show()

The above histogram shows that the 9th day of the month experienced. Highest number of attacks, totaling 720. The chart also indicates the variability in malware attack frequency across different days, highlighting potential patterns in attack frequency.

In [ ]:
# month Distribution

plt = px.histogram(
    df, 
    x='Month', 
    title='Month', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # Using the same Pastel color sequence
)
plt.show()
In [ ]:
# Checking the Month Column ploting with plotly
plt = px.histogram(
    df, 
    x='Month', 
    color='Malware Indicators', 
    title='Number of Malware Attacks by Month',
    color_discrete_map={'0': 'lightblue', '1': 'salmon'}  # Choose any two contrasting colors
)
plt.show()

The graph displays the number of malware assaults that occurred in August. 1,861 incidents were counted. Compared to previous months, this one has had the most attacks. It indicates that August had higher risks.

In [ ]:
# Year Distrition

plt = px.histogram(
    df, 
    x='Year', 
    title='Year', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # You can specify any color code or name you prefer
)
plt.show()
In [ ]:
# Checking the Day Column ploting with plotly

plt = px.histogram(
    df, 
    x='Year', 
    color='Malware Indicators', 
    title='Number of Malware Attacks by Year',
    color_discrete_map={'0': 'lightblue', '1': 'salmon'}  # Choose any two contrasting colors
)
plt.show()

The histogram graph shows that most malware attacks were happened from mid-2021 to mid 2022. This period experienced the highest frequency of incidents. It highlights a significant increase in cyber attack during this time periods

In [ ]:
# Checking the Protocol distribution with Bar Chart Using Plotly

plt = px.histogram(
    df, 
    x='Protocol', 
    color='Malware Indicators', 
    title='Number of Malware Attacks by Protocol',
    color_discrete_map={'0': 'lightblue', '1': 'salmon'}  # Choose any two contrasting colors
)
plt.show()

Network Protocol Descriptions (ICMP, UDP, TCP)¶

ICMP (Internet Control Message Protocol):

  • ICMP sends error messages and operational information about packet processing issues.
  • It is used for diagnostics, like ping and traceroute.
  • ICMP operates at the network layer (Layer 3) of the OSI model.

UDP (User Datagram Protocol):

  • UDP is a simple, connectionless protocol for sending packets with low overhead.
  • It operates at the transport layer (Layer 4) and is used for fast, efficient applications like streaming and gaming.

TCP (Transmission Control Protocol):

  • TCP is a connection-oriented protocol ensuring reliable, ordered data delivery with error-checking.
  • It operates at the transport layer (Layer 4) of the OSI model and is used for critical applications like web browsing and email.

Analyse the Traffic Type¶

In [ ]:
# Traffic Distribution

plt = px.pie(
    df, 
    names='Traffic Type', 
    title='Traffic Distribution', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # Choose any color sequence
)
plt.show()

The pie chart reveals an almost similar distribution of traffic types: DNS (33.4%), HTTP (33.4%), and FTP (33.2%). This suggests that the three traffic categories are being used in a balanced manner.

In [ ]:
# Ploting the Traffic Type distribution with Bar Chart Using Plotly

plt = px.histogram(
    df, 
    x='Traffic Type', 
    color='Malware Indicators', 
    title='Number of Malware Attacks by Traffic Type',
    color_discrete_map={'0': 'lightblue', '1': 'salmon'}  # Choose any two contrasting colors
)
plt.show()

HTTP (Hypertext Transfer Protocol):

  • HTTP is used to transmit web pages and web content over the internet.
  • It operates at the application layer (Layer 7) of the OSI model.
  • HTTP is stateless, meaning each request-response interaction is independent.
  • It is the foundation for web browsing and accessing websites.

DNS (Domain Name System):

  • DNS translates domain names (like www.example.com) into IP addresses.
  • It operates at the application layer (Layer 7) of the OSI model.
  • DNS allows users to access websites using easy-to-remember names instead of numerical IP addresses.

FTP (File Transfer Protocol):

  • FTP transfers files between a client and a server on a network.
  • It operates at the application layer (Layer 7) of the OSI model.
  • Common uses of FTP include uploading website files to a server and sharing files between computers.

Analyzing the Attack Type¶

In [ ]:
# Attack Type Distribution
plt = px.pie(
    df, 
    names='Attack Type', 
    title='Analysing the Attack Type Distribution', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # Choose any color sequence
)
plt.show()

The pie graphic depicts a nearly equal distribution of attack types: DDOS (33.6%), malware (33.3%), and intrusion (33.25%). This means that each form of attack occurs at a comparable frequency. The statistics indicate a balanced threat landscape among these three assault types.

In [ ]:
# Checking the attack types distribution with Bar Chart Using Plotly
plt = px.histogram(
    df, 
    x='Attack Type', 
    color='Traffic Type', 
    title='Number of Malware Attacks by Attack Type',
    color_discrete_map={'DNS': 'lightblue', 'HTTP': 'salmon', 'FTP': 'lightgreen'}  # Choose any colors you like
)
plt.show()

Analyzing the Browser, Devices and Attack Types¶

In [ ]:
# Browsers Distribution
plt = px.pie(
    df, 
    names='Browser', 
    title='Browser Distribution', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # Choose any color sequence
)
plt.show()

The "Browser Distribution" pie chart shows that Mozilla is used by 20.1% of users, while other browsers account for 79.9%. This shows that, while considerable, Mozilla is not the most popular browser among users, with the majority choosing alternative options.

In [ ]:
# Platform Distribution
plt = px.pie(
    df, 
    names='Device/OS', 
    title='Platform Distribution', 
    color_discrete_sequence=px.colors.qualitative.Pastel  # Choose any color sequence
)
plt.show()

The chart shows the distribution of different platforms on the smartphone market. Android has the largest market share with 46.7%, followed by iOS with 23%. Windows, Linux, Macintosh, and iPod platforms all have a market share of less than 7% each.

In [ ]:
# Platform Distribution with Bar Chart 
plt = px.histogram(df, x ='Device/OS', color= 'Browser', title = 'Platform Distribution')
plt.show()
In [ ]:
# Checking the Browser and Devices with Attack Type distribution with Bar Chart Using Plotly
plt = px.histogram(df, x= 'Device/OS', color = 'Attack Type', title = 'Number of Malware Attacks by Browser and Devices')
plt.show()
In [ ]:
# checking the browser against the attack type
plt = px.histogram(df, x= 'Browser', color='Attack Type', title= 'Number of Attacks by Browser')
plt.show()

Analysing the Log Source, Action Taken¶

In [ ]:
# Log Source Distribution
plt = px.histogram(df, x='Log Source', title='Log Source')
plt.show()
In [ ]:
# Log Source Distribution
plt = px.histogram(
    df, 
    x='Action Taken', 
    title='Action Taken',
    color_discrete_sequence=px.colors.qualitative.Pastel  # Use the custom color sequence
)
plt.show()
In [ ]:
# Log Source Distribution
plt = px.histogram(df, x='Action Taken', color='Attack Type', title='Log Source')
plt.show()
In [ ]:
# Log Source Distribution
plt = px.histogram(df, x='Log Source', color='Attack Type', title='Log Source')
plt.show()

Packet Length Distribution for Various Attack Types¶

In [ ]:
import plotly.graph_objs as go

# Filter data for each attack type
malware_data = df[df['Attack Type'] == 'Malware']['Packet Length']
intrusion_data = df[df['Attack Type'] == 'Intrusion']['Packet Length']
ddos_data = df[df['Attack Type'] == 'DDoS']['Packet Length']

# Create histograms for each attack type
malware_histo = go.Histogram(x=malware_data, name='Malware', opacity=0.7)
intrusion_histo = go.Histogram(x=intrusion_data, name='Intrusion', opacity=0.7)
ddos_histo = go.Histogram(x=ddos_data, name='DDoS', opacity=0.7)

# Create layout
layout = go.Layout(title='Packet Length Distribution for Various Attack Types',
                   xaxis=dict(title='Packet Length'),
                   yaxis=dict(title='Frequency'))

# Create figure
fig = go.Figure(data=[malware_histo, intrusion_histo, ddos_histo], layout=layout)

# Show plot
fig.show()

The histogram shows the packet length distributions for Malware, Intrusion, and DDoS attacks. Malware has a definite peak, while intrusion has greater variability, and DDoS has clustered patterns, providing insights into attack characteristics for focused security methods.

Conclusion¶

Finally, the study of cybersecurity insights extracted from a dataset of 40,000 records offers light on significant patterns and vulnerabilities. It demonstrates a diverse environment of browser and device usage, with Windows devices being the primary targets for potential cyber threats. The temporal analysis reveals noteworthy patterns, such as increased assault rates on specific dates and months, which may indicate possible vulnerabilities or purposeful tactics by threat actors. Using these data, firms may strategically deploy resources, strengthen endpoint security, and improve incident response methods to boost their defenses against changing cyber threats. Businesses can improve the security of their digital infrastructure and data assets by implementing proactive steps based on these findings. .

Recommendations¶

Following are key recommendations to strengthen your organization's cybersecurity posture based on the data analysis:

Threat Intelligence Gathering:

  • Enhance Real-Time Threat Monitoring: Integrate automated threat intelligence systems to stay on top of the latest cyber threats and vulnerabilities. Collaborate with cybersecurity communities to gain broader insights and develop more effective threat mitigation strategies.
  • Contextual Analysis: Analyze threat data alongside external factors like global events or industry trends to anticipate and prepare for potential attacks.

Vulnerability Assessment and Risk Prioritization:

  • Regular Assessments: Conduct frequent vulnerability assessments to identify and address security weaknesses before they can be exploited.
  • Risk Management: Develop a risk prioritization strategy that considers both the potential impact and likelihood of various threats, allowing for focused and efficient resource allocation.

Security Awareness Training Development:

  • Comprehensive Training Programs: Design training programs tailored to different user roles within the organization to improve overall security awareness and response capabilities.
  • Phishing Simulations: Implement regular phishing simulations to test and reinforce employee readiness, keeping them alert to potential email-based attacks.

Endpoint Security Solutions Evaluation and Deployment:

  • Advanced Endpoint Protection: Deploy robust endpoint security solutions equipped with features like antivirus, anti-malware, and firewalls. Prioritize protecting the most commonly used devices, such as Windows systems.
  • Cross-Platform Security: Ensure adequate security measures are also in place for less common devices (e.g., iPads, iPods) to prevent them from becoming attacker entry points.

Cybersecurity Policy Development:

  • Clear Policies and Procedures: Develop and document comprehensive cybersecurity policies outlining roles, responsibilities, and standard operating procedures for security incidents.
  • Policy Review and Update: Regularly review and update policies to address emerging threats and changes in the business environment.

Incident Response Plan Documentation:

  • Detailed Incident Response Plans: Create detailed incident response plans covering detection, containment, eradication, and recovery processes.
  • Regular Drills: Conduct regular incident response drills to ensure all team members understand their roles and responsibilities during a cyber incident.

Security Governance Framework Implementation:

  • Governance Framework: Implement a robust security governance framework to ensure that cybersecurity efforts are aligned with business objectives and regulatory requirements.
  • Continuous Improvement: Establish a governance committee to oversee cybersecurity initiatives, ensuring continuous improvement and adaptation to new threats.